Failover clustering
is a technique that uses a cluster of SQL Server instances to protect
against failure of the instance currently serving your users. Failover
clustering is based on a hardware solution comprised of multiple
servers (known as nodes)
that share the same disk resources. One server is active and owns the
database. If that server fails, then another server in the cluster will
take over ownership of the database and continue to serve users.
1. Key Terms
When discussing high
availability, each technique has its own set of key terms. At the
beginning of each section, we will list the terms used for each
solution. Here are some of the terms you need to be familiar with when
setting up a failover cluster:
Node: Server that participates in the failover cluster.
Resource group: Shared set of disks or network resources grouped together to act as a single working unit.
Active node: Node that has ownership of a resource group.
Passive node: Node that is waiting on the active node to fail in order to take ownership of a resource group.
Heartbeat: Health checks sent between nodes to ensure the availability of each node.
Public network: Network used to access the failover cluster from a client computer.
Private network: Network used to send heartbeat messages between nodes.
Quorum: A special resource group that holds information about the nodes, including the name and state of each node.
2. Failover Clustering Overview
You can use failover
clustering to protect an entire instance of SQL Server. Although the
nodes share the same disks or resources, only one server may have
ownership (read and write privileges) of the resource group at any
given time. If a failover occurs, the ownership is transferred to
another node, and SQL Server is back up in the time it takes to bring
the databases back online. The failover usually takes anywhere from a
few seconds to a few minutes, depending on the size of the database and
types of transactions that may have been open during the failure.
In order for the database
to return to a usable state during a failover, it must go through a
Redo phase to roll forward logged transactions and an Undo phase to
roll back any uncommitted transactions. Fast recovery is an Enterprise
Edition feature that was introduced in SQL Server 2005 that allows
applications to access the database as soon as the Redo phase has
completed. Also, since the cluster appears on the network as a single
server, there is no need to redirect applications to a different server
during a failover. The network abstraction combined with fast recovery
makes failing over a fairly quick and unobtrusive process that
ultimately results in less downtime during a failure.
The number of nodes that can be
added to the cluster depends on the edition of the operating system
(OS) as well as the edition of SQL Server, with a maximum of 16 nodes
using SQL Server 2008 Enterprise Edition running on Windows Server
2008. Failover clustering is only supported in the Enterprise and
Standard Editions of SQL Server. If you are using the Standard Edition,
you are limited to a 2-node cluster. Since failover clustering is also
dependant on the OS, you should be aware of the limitations for each
edition of the OS as well. Windows Server 2008 only supports the use of
failover clustering in its Enterprise, Datacenter, and Itanium
Editions. Windows Server 2008 Enterprise and Datacenter Editions both
support a 16-node cluster, while the Itanium edition only supports an
8-node cluster.
In order to spread the
workload of multiple databases across servers, every node in a cluster
can have ownership of its own set of resources and its own instance of
SQL Server. Every server that owns a resource is referred to as an active node, and every server that does not own a resource is referred to as a passive node.
There are two basic types of cluster configurations: a single-node
cluster and a multi-node cluster. A multi-node cluster contains two or
more active nodes, and a single-node cluster contains one active node
with one or more passive nodes. Figure 1 shows a standard single-node cluster configuration commonly referred to as an active/passive configuration.
So if multiple active
nodes allow you to utilize all of your servers, why wouldn't you make
all the servers in the cluster active? The answer: Resource
constraints. In a 2-node cluster, if one of the nodes has a failure,
the other available node will have to process its normal load as well
as the load of the failed node. For this reason, it is considered best
practice to have one passive node per active node in a failover
cluster. Figure 2 shows a healthy multi-node cluster configuration running with only two nodes. If Node 1 has a failure, as demonstrated in Figure 3,
Node 2 is now responsible for both instances of SQL Server. While this
configuration will technically work, if Node 2 does not have the
capacity to handle the workload of both instances, your server may slow
to a crawl, ultimately leading to unhappy users in two systems instead
of just one.
So, how does all of this
work? A heartbeat signal is sent between the nodes to determine the
availability of one another. If one of the nodes has not received a
message within a given time period or number of retries, a failover is
initiated and the primary failover node takes ownership of the
resources. It is the responsibility of the quorum drive to maintain a
record of the state of each node during this process. Heartbeat checks
are performed at the OS level as well as the SQL Server level. The OS
is in constant contact with the other nodes, checking the health and
availability of the servers. For this reason, a private network is used
for the heartbeat between nodes to decrease the possibility of a
failover occurring due to network-related issues. SQL Server sends
messages known as LooksAlive and IsAlive. LooksAlive is a less intrusive check that runs every 5 seconds to make sure the SQL Server service is running. The IsAlive check runs every 60 seconds and executes the query Select @@ServerName against the active node to make sure that SQL Server can respond to incoming requests.
3. Implementation
Before you can even
install SQL Server, you have to make sure that you have configured a
solid Windows cluster at the OS and hardware levels. One of the major
pain points you used to have when setting up a failover cluster is
searching the Hardware Compatibility List (HCL) to ensure that the
implemented hardware solution would be supported. This requirement has
been removed in Windows Server 2008. You can now run the new cluster
validation tool to perform all the required checks that will ensure you
are running on a supported configuration. Not only can the cluster
validation tool be used to confirm the server configuration, it can be
used to troubleshoot issues after setup as well.
From a SQL Server
perspective, the installation process is pretty straightforward. If you
are familiar with installing a failover cluster in SQL Server 2005, it
has completely changed in SQL Server 2008. First you run the
installation on one node to create a single-node cluster. Then you run
the installation on each remaining node, choosing the Add Node option
during setup. In order to remove a node from a cluster, you run
setup.exe on the server that needs to be removed and select the Remove
Node option. The new installation process allows for more granular
manageability of the nodes, allowing you to add and remove nodes
without bringing down the entire cluster. The new installation process
also allows you to perform patching and rolling upgrades with minimal
downtime.
Do not be afraid of failover
clustering. It takes a little extra planning up front, but once it has
been successfully implemented, managing a clustered environment is
really not much different than any other SQL environment. Just as the
SQL instance has been presented to the application in a virtual network
layer, it will be presented to the administrator this way as well. As
long as you use all the correct virtual names when connecting to the
SQL instance, it will just be administration as usual.
4. Pros and Cons of Failover Clustering
As with any other
technology solution, failover clustering brings certain benefits, but
at a cost. Benefits of failover clustering include the following:
The price you pay for these benefits is as follows:
Failover clustering does not protect against disk failure.
No standby database is available for reporting.
Special hardware is required.
Failover clustering is generally expensive to implement.
There is no duplicate data for reporting or disaster recovery.